Introduction and Problem statement

Harvard and MIT founded edX , a major online education provider in the United States. It offers university level courses in a variety of fields to a global student base, with some courses available for free. Over the course of this time with over 3000 courses and over 1.4 million certifications granted to students, Harvard and MIT have played an instrumental role in the development of a thriving market for college-level content. From the kaggle dataset provided to us for this Hackathon challenge we decided to infer a few details about this huge domain and come up with visualizations that will better showcase the content of the dataset. We also did work with quite a few new libraries like zoo which intends to perform calculations containing time series of numeric vectors, matrices and factors and Highcharter in R which is a very flexible and customizable charting library.

The visualizations presented in the below report consists of a blend of charts and plots while using various columns like gender, course subject, launch date of the course , audited courses from the dataset. The first section deals with the course subjects offered by Harvard and MIT and the count of %Audited according to institution and Course Subjects. It also includes the count of %Audited according to Institution and Course subjects. The second section deals with interpreting Course Subjects on various factors. We have used box plots, heat maps,and highcharts for showing this trend. The third section deals with the count of participants(Course content accessed), we have visualized the former according to course, year and students who accessed 50% course. Other than this, we have also visualized the density of gender ratio in a particular course.

Q1: What are the courses offered by Harvard and MIT? Which ones are the most offered courses from each of these universities?

Output

The two highcharts here show the number of courses offered by MIT and Harvard respectively and under which category. We can see that MIT offers almost 50% of courses that fall just under Science, Technology, Engineering and Mathematics and minimal courses under Humanities,History,Design,Religion and Education. On the other hand offers, Harvard offers mostly Humanities,History,Design,Religion and Education course.

Q2: What is the count of %Audited according to Institutions and Course subjects?

Output

The first stacked bar plot shows the courses offered by MIT have higher courses approved for auditing.Similarly, in the next bar plot we can see the different course subjects and the percentage of each of these that are audited. Science, Technology, Engineering and Mathematics are the most audited courses. This depicts that Students are more inclined towards Science and Technology courses. These Institutions should encourage students to study other courses too.

Q3: What are the course subjects that are certified , how many of them are posted on the forum? What is the total course hours for the Subjects? What is median hours for certification for courses?

Output

We have used box plots to answer these questions. The first box plot shows that Computer science courses are the highest certified courses and Science , Technology , Engineering and Mathematics is one of the least. The next box plot answers the percentage of the courses posted on the forum that shows Humanities, History, Design,Religion and education are the highest among the rest. The next plot shows the total course hours of each of the course subjects, and we can observe here that Computer Science is a clear winner. And finally the last box plot summarizes the Median hours of Certification required for all the course subjects and while Science, Technology, Engineering and mathematics and Computer science are quite close in their results, however STEM courses require close to 50 hours more.

Q6: What is the trend of the course subjects accessed across the years?

Output

Here to answer this business question we have used an alluvial chart to show the trend of the course subject accessed across the years. We can see that there is a good blend in the courses accessed from both MIT and Harvard. But we can clearly see that courses from Computer Science and Technology,Engineering and Mathematics are mostly preferred from MIT whereas Government, Health and Social Science and History,Design,Religion and Education are mostly preferred from Harvard.

Q7: Find out the relationship between audited courses and the participants who have accessed the course content.

Output

The scatter plot here shows the relationship between the two variables namely Participants and Audited courses. Through the results here we can observe a moderately strong positive relationship between the two.

Conclusion

During this Hackathon we as a team worked on the dataset:Online Courses by MIT and Harvard from Kaggle. The aim was to create clear and concise interpretations from the data to better understand it and all its attributes along with improving our skills in data visualization in R. We have used several libraries in the course of this assignment. We used ggalluvial which uses geom alluvial to create alluvial charts, Highcharts in R to create interactive and dynamic charts, Lubridate in R to make it easy to work with dates.

Some of the key findings from the dataset that we came across is that MIT offers almost 50% of courses that fall just under Science, Technology, Engineering and Mathematics and minimal courses under Humanities, History, Design, Religion and Education. Whereas Harvard on the under hand offers more that 50% of it courses under this. Through the alluvial chart we see that the courses from Computer Science and Technology, Engineering and Mathematics are mostly preferred from MIT whereas Government, Health and Social Science and History, Design, Religion and Education are mostly preferred from Harvard. We also have a heatmap which is another way to visualize hierarchical clustering and, in our case, produces a few useful insights on the course hours of all the course subjects across 2012 to 2016. In conclusion this assignment has better equipped us to utilize the right kind of charts for the right set of data that is being compared or correlated. It has also familiarized us with the vast range of color palettes that can be used in R to beautify our plots/graphs.

From the visualizations we did, it can be interpreted that :

Female population should be encouraged to study Science and Technology courses.

Harvard should focus on developing Science and Technology courses as students are choosing MIT over Harvard. And, MIT should focus Government, History and Design courses.

There are many students who access course content but do not complete it. Institutions should focus on students getting the course done